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DEPENDENT MULTINOMIAL MODELS MADE EASY: 

STICK BREAKING WITH THE POLYA-GAMMA AUGMENTATION 


By Scott W. Linderman,* Matthew J. Johnson,* and Ryan P. Adams 

Harvard University 

Many practical modeling problems involve discrete data that are 
best represented as draws from multinomial or categorical distribu¬ 
tions. For example, nucleotides in a DNA sequence, children’s names 
in a given state and year, and text documents are all commonly mod¬ 
eled with multinomial distributions. In all of these cases, we expect 
some form of dependency between the draws: the nucleotide at one 
position in the DNA strand may depend on the preceding nucleotides, 
children’s names are highly correlated from year to year, and topics 
in text may be correlated and dynamic. These dependencies are not 
naturally captured by the typical Dirichlet-multinomial formulation. 

Here, we leverage a logistic stick-breaking representation and recent 
innovations in Polya-gamma augmentation to reformulate the multi¬ 
nomial distribution in terms of latent variables with jointly Gaussian 
likelihoods, enabling us to take advantage of a host of Bayesian in¬ 
ference techniques for Gaussian models with minimal overhead. 


1. Introduction. It is often desirable to model discrete data in terms of continuous latent 
structure. In applications involving text corpora, discrete-valued time series, or polling and pur¬ 
chasing decisions, we may want to learn correlations or spatiotemporal dynamics, and leverage 
these learned structures to improve inferences and predictions. However, adding these continuous 
latent dependence structures often comes at the cost of significantly complicating inference: such 
models may require specialized, one-off inference algorithms, such as a nonconjugate variational 
optimization, or they may only admit very general inference tools like particle MCMC (Andrieu 
et al., 2010) or elliptical slice sampling (Murray et ah, 2010), which can be inefficient and difficult 
to scale. Developing, extending, and applying these models has remained a challenge. 

In this paper we aim to provide a class of such models that are easy and efficient. We develop 
models for categorical and multinomial data in which dependencies among the multinomial param¬ 
eters are modeled via latent Gaussian distributions or Gaussian processes, and we show that this 
flexible class of models admits a simple auxiliary variable method that makes inference easy, fast, 
and modular. This construction not only makes these models simple to develop and apply, but also 
allows the resulting inference methods to use off-the-shelf algorithms and software for Gaussian 
processes and linear Gaussian dynamical systems. 

The paper is organized as follows. After providing background material and defining our general 
models and inference methods, we demonstrate the utility of this class of models by applying it to 
three domains as case studies. First, we develop a correlated topic model for text corpora. Second, 
we study an application to modeling the spatial and temporal patterns in birth names given only 
sparse data. Finally, we provide a new continuous state-space model for discrete-valued sequences, 
including text and human DNA. In each case, given our model construction and auxiliary variable 
method, inference algorithms are easy to develop and very effective in experiments. We conclude 
with comments on the new kinds of models these methods may enable. 
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Code to use these models and reproduce all the figures is available at https: //github. com/ 
HIPS/pgmult. 

2. Modeling correlations in multinomial parameters. In this section, we discuss an auxil¬ 
iary variable scheme that allows multinomial observations to appear as Gaussian likelihoods within 
a larger probabilistic model. The key trick discnssed in the proceeding sections is to introduce 
Polya-gamma random variables into the joint distribution over data and parameters in such a way 
that the resulting marginal leaves the original model intact. 

2.1. Polya-gamma augmentation. The integral identity at the heart of the Polya-gamma aug¬ 
mentation scheme (Poison et ah, 2013) is 

( 1 ) = 

where k = a — b/2 and p{oj \ 6,0) is the density of the Polya-gamma distribution PG(6,0), which 
does not depend on 'll;. Gonsider a likelihood function of the form 

( Ip\a{x) 

( 2 ) P{x\^) = c{x)j^^^ 

for some functions a, b, and c. Such likelihoods arise, e.g., in logistic regression and in binomial and 
negative binomial regression (Poison et ah, 2013). Using (1) along with a prior p{^), we can write 
the joint density of {tp,x) as 

lp'ip\a{x) poo 

(3) pi'ip^x) p(V’) c(x) /^p(a; | 6(x), 0) dw. 

The integrand of (3) defines a joint density on (V', x, lo) which admits pltp, x) as a marginal density. 
Conditioned on these auxiliary variables u, we have 

(4) p(V> I X, w) oc 

which is Gaussian when p{'ijj) is Gaussian. Furthermore, by the exponential tilting property of the 
Polya-gamma distribution, we have a; |'j/^, x ~ PG(6(x), V’)- Thus the identity (1) gives rise to a 
conditionally conjugate augmentation scheme for Gaussian priors and likelihoods of the form (2). 

This augmentation scheme has been used to develop Gibbs sampling and variational inference 
algorithms for Bernoulli, binomial (Poison et ah, 2013), and negative binomial regression mod¬ 
els (Zhou et ah, 2012) with logit link functions. In this paper we extend it to the multinomial 
distribution. 

2.2. A Polya-gamma augmentation for the multinomial distribution. To develop a Polya-gamma 
augmentation for the multinomial, we first rewrite the ih-dimensional multinomial density recur¬ 
sively in terms of K — 1 binomial densities: 


K-l 

(5) Mult(x I A^, tt) = Bin(xfc | iVfc, ^k), 

k=l 

(6) Nk = N-'^Xj, Wfc = -- ^ --, k = 2,3,...,K, 

j<k l^j<k 
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Fig 


I: Correlated 2D Gaussian priors 


on xj) and their implied densities on 7rsB(’0)- See text for details. 


where Ni = N = and tti = tti. For convenience, we define N{x) = [A^i,..., Nk-i]- This de¬ 

composition of the multinomial density is a “stick-breaking” representation where each vr^ represents 
the fraction of the remaining probability mass assigned to the k-th. component. We let nk = cr{'ijjk) 
and define the function, ttsb : —>■ [0,1]^, which maps a vector xp to a normalized probability 

vector TT. 

Next, we rewrite the density into the form required by (1) by substituting a{xjjj.) for 

/ AT \ 

Mult{x\ N,xIt) = Bin(xfc I iVfc,o-(V'fc)) = ^ 

k=i k=i 

M (1 + ■ 


(7) 

( 8 ) 


Choosing ak{x) = Xk and bk{x) = Nk for each fc = l,2,...,iC — 1, we can then introduce Polya- 
gamma auxiliary variables ujk corresponding to each coordinate xpk] dropping terms that do not 
depend on xp and completing the square yields 

K-l 

p{x, Uj\xp) (X n 

k=l 

where ft = diag(a;) and k{x) = x — N{x)/2. That is, conditioned on cj, the likelihood of xp under 
the augmented multinomial model is proportional to a diagonal Gaussian distribution. 

Figure 1 illustrates how a variety of Gaussian densities map to probability densities on the sim¬ 
plex. Correlated Gaussians (left) put most probability mass near the vri = 7r2 axis of the simplex, and 
anti-correlated Gaussians (center) put mass along the sides where vri is large when 7r2 is small and 
vice-versa. Finally, a nearly spherical Gaussian approximates a symmetric Dirichlet distribution. 


xp 


ft ^n{x), ri ^), 
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Fig 2: A comparison of correlated topic model performance. The left panel shows a subset of the inferred topic 
correlations for the AP News corpus. Two examples are highlighted: a) positive correlation between topics 
{house, committee, congress, law) and {Bush, Dukakis, president, campaign), and b) anticorrelation between 
{percent, year, billion, rate) and {court, case, attorney, judge). The middle and right panels demonstrate the 
efficacy of our SB-CTM relative to competing models on the AP News corpus and the 20 Newsgroup corpus, 
respectively. 


Appendix A derives a closed-form expression for the density on tt implied by a Gaussian distribu¬ 
tion on ■?/>, as well an expression for a diagonal Guassian distribution that best approximates, in a 
moment-matching sense, a Dirichlet distribution on tt. 

2.3. Alternative models. This stick breaking transformation has been explored in the previous 
work of Ren et ah (2011) and Khan et al. (2012), but has not been connected with the Polya-gamma 
augmentation. The multinomial probit and logistic normal methods are most commonly used, and 
we describe them here. 

The multinomial probit model (Albert and Chib, 1993) applies to categorical regrssion, and is 
based upon the following auxiliary variable model: Zk = + ^k), where ‘h is the probit function 

and Cfc ~ AA(0,1). Given these auxiliary variables, Xk = 1 A Zk > zj'ij ^ k, and Xk = 0 otherwise. 
This approach has primarily focused on categorical modeling. 

A common method of modeling correlated multinomial parameters, to which we directly compare 
in the following sections, is based on the “softmax” or multi-class logistic function, vr^ = e^'^ / 

Let ttln : t [0,1]^ denote the joint transformation. This has found application in multiclass 

regression (Holmes et ah, 2006) and is the common approach to correlated topic modeling (Blei 
and Lafferty, 2006a). However, leveraging this transformation in conjunction with Gaussian models 
for r/j is challenging due to the lack of conjugacy, and previous work has relied upon variational 
approximations to tackle the hard inference problem (Blei and Lafferty, 2006a). 

Unlike the logistic normal and multinomial probit, the stick-breaking transformation we employ 
is asymmetric. Our illustrations in Figure 1 and the discussion in Appendix A show that this lack 
of symmetry does not impair the representational capacity of the model. 

3. Correlated topic models. The Latent Dirichlet Allocation (LDA) (Blei et ah, 2003) is 
a popular model for learning topics from text corpora. The Correlated Topic Model (CTM) (Blei 
and Lafferty, 2006a) extends LDA by including a Gaussian correlation structure among topics. 
This correlation model is powerful not only because it reveals correlations among topics but also 
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because inferring such correlations can significantly improve predictions, especially when inferring 
the remaining words in a document after only a few have been revealed (Blei and Lafferty, 2006a). 
However, the addition of this Gaussian correlation structure breaks the Dirichlet-Multinomial con- 
jugacy of LDA, making estimation and particularly Bayesian inference and model-averaged predic¬ 
tions more challenging. An approximate maximum likelihood approach using variational EM (Blei 
and Lafferty, 2006a) is often effective, but a fully Bayesian approach which integrates out param¬ 
eters may be preferable, especially when making predictions based on a small number of revealed 
words in a document. A recent Bayesian approach based on a Polya-Gamma augmentation to the 
logistic normal CTM (LN-CTM) (Chen et ah, 2013) provides a Gibbs sampling algorithm with 
conjugate updates, but the Gibbs updates are limited to single-site resampling of one scalar at a 
time, which can lead to slow mixing in correlated models. 

In this section we show that MCMC sampling in a correlated topic model based on the stick 
breaking construction (SB-CTM) can be significantly more efficient than sampling in the LN-CTM 
while maintaining the same integration advantage over EM. 

In the standard LDA model, each topic j3^ (t = 1, 2,..., T) is a distribution over a vocabulary of 
V possible words, and each document d has a distribution over topics 0^ (d = 1, 2,..., D). The nth 
word in document d is denoted Wn,d for d = 1, 2,..., N^. When each (3^ and Od is given a symmetric 
Dirichlet prior with concentration parameters and ag, respectively, the full generative model is 

( 10 ) ~ Dir(a;/ 3 ), 0^ ~ Dir(a 0 ), Zn,d\ 6d ^ Cat{6d), Wn,d\ Zn,d,{f3t} ^ Cat 

The CTM replaces the Dirichlet prior on each 0^ with a new prior that models the coordinates of Od 
as mutually correlated. This correlation structure on 6d is induced by first sampling a correlated 
Gaussian vector and then applying the logistic normal map: 

( 11 ) ^d ='^LNi'iPd) 

where the Gaussian parameters (/x, El) can be given a conjugate normal-inverse Wishart (NIW) 
prior. Analogously, our SB-CTM generates the correlation structure by instead applying the stick¬ 
breaking logistic map: 

(12) V’dI = 7rSB(V’d)- 

The goal is then to infer the posterior distribution over the topics /3j, the documents’ topic allo¬ 
cations and their mean and correlation structure (//, S). (In the case of the EM algorithm of 
(Blei and Lafferty, 2006a), the task is to approximate maximum likelihood estimates of the same 
parameters.) Modeling correlation structure within the topics [3 can be done analogously. 

Eor fully Bayesian MCMC inference in the SB-CTM, we develop a Gibbs sampler that exploits 
the block conditional Gaussian structure provided by the stick-breaking construction. The Gibbs 
sampler iteratively samples z\w,P,^p; I3\ z,w; ip\z, /i, E), u; and /x, El 1 1 /); as well as the auxil¬ 
iary variables u\'ip,z. The first two are standard updates for LDA models, so we focus on the latter 
three. Using the identities derived in Section 2.2, the conditional density of each •0^ | Zd,fi,'S,uj 
can be written 

(13) p{^a\zd,u:d) oc M{'ip\K{cd),fl^^) oc 

where 

(14) ^ = S [^(crf)-L V] , S = [fJrf-L Cd,t = '^^Zn,d = t], fld = diag{ud), 
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and so it is resampled as a joint Gaussian. The correlation structure parameters fi and S with a 
conjugate NIW prior are sampled from their conditional NIW distribution. Finally, the auxiliary 
variables u are sampled as Polya-Gamma random variables, with \ z^, "0^ PG(iV(cd),0J. A 
feature of the stick-breaking construction is that the the auxiliary variable update can be performed 
in an embarrassingly parallel computation. 

We compare the performance of this Gibbs sampling algorithm for the SB-CTM to the Gibbs 
sampling algorithm of the LN-CTM (Ghen et ah, 2013), which uses a different Polya-gamma aug¬ 
mentation, as well as the original variational EM algorithm for the GTM and collapsed Gibbs 
sampling in standard LDA. Figure 2 shows results on both the AP News dataset and the 20 News- 
groups dataset, where models were trained on a random subset of 95% of the complete documents 
and tested on the remaining 5% by estimating held-out likelihoods of half the words given the 
other half. The collapsed Gibbs sampler for LDA is fast but because it does not model correlations 
its ability to predict is significantly constrained. The variational EM algorithm for the GTM is 
reasonably fast but its point estimate doesn’t quite match the performance from integrating out 
parameters via MCMG in this setting. The LN-GTM Gibbs sampler continues to improve slowly 
but is limited by its single-site updates, while the SB-GTM sampler seems to both mix effectively 
and execute efficiently due to its block Gaussian updating. 

The SB-CTM demonstrates that the stick-breaking construction and corresponding Polya-Gamma 
augmentation makes inference in correlated topic models both easy to implement and computa¬ 
tionally efficient. The block conditional Gaussianity also makes inference algorithms modular and 
compositional: the construction immediately extends to dynamic topic models (DTMs) (Blei and 
Lafferty, 2006b), in which the latent 0^ evolve according to linear Gaussian dynamics, and inference 
can be implemented simply by applying off-the-shelf code for Gaussian linear dynamical systems 
(see Section 5). Finally, because LDA is so commonly used as a component of other models (e.g. 
for images (Wang and Grimson, 2008)), easy, effective, modular inference for GTMs and DTMs is 
a promising general tool. 

To apply correlated topic models to increasingly large datasets, a stochastic variational inference 
(SVI) (Hoffman et ah, 2013) approach is promising. In Appendix C, we show that the stick-breaking 
construction enables an algorithm based on the Polya-gamma augmentation that can work with 
subsets, or mini-batches, of data in each iteration. As with the Gibbs sampler, the conditionally 
conjugate structure makes the algorithm easy to derive and implement. 

4. Gaussian processes with multinomial observations. Gonsider the United States cen¬ 
sus data, which lists the first names of children born in each state for the years 1910-2013. Suppose 
we wish to predict the probability of a particular name in New York State in the years 2012 and 
2013 given observed names in earlier years. We might reasonably expect that name probabilities 
vary smoothly over time as names rise and fall in popularity, and that name probability would be 
similar in neighboring states. A Gaussian process naturally captures these prior intuitions about 
spatiotemporal correlations, but the observed name counts are most natnrally modeled as multino¬ 
mial draws from latent probability distributions over names for each combination of state and year. 
We show how efficient inference can be performed in this otherwise difficult model by leveraging 
the Polya-gamma augmentation. 

Let Z G denote the matrix of D dimensional inputs and X 

served K dimensional connt vectors for each input. In our example, each row Zm of Z corresponds 
to the year, latitude, and longitude of an observation, and K is the number of names. Underlying 
these observations we introduce a set of latent variables, 0m,fc such that the probability vector at 
input Zm is TTm = :)■ ^he auxiliary variables for the k-th name, 0.^, are linked via a 

Gaussian process with covariance matrix, C, whose entry Ctj is the covariance between input 2 :^ 
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Fig 3: A spatiotemporal Gaussian process applied to the names of children born in the United States from 
1960-2013. With a limited dataset of only 50 observations per state/year, the stick breaking and logistic 
normal multinomial GPs (SBM-GP and LNM-GP) outperform naive approaches in predicting the top and 
bottom 10 names (bottom left, parentheses: std. error). Our SBM-GP, which leverages the Polya-gamma 
augmentation, is considerably more efficient than the non conjugate LNM-GP (bottom right). 


and Zj under the GP prior, and mean vector The covariance matrix is shared by all names, 
and the mean is empirically set to match the measured name probability. The full model is then, 

for A:G{l,...,iA-l} 

Xm ~ Mult(Wm, 7rsB('0m,:)) for m G {1,..., M}. 

To perform inference, introduce auxiliary Polya-gamma variables, for each 'iprn,k- Condi¬ 
tioned on these variables, the conditional distribution of ^|y. ^ is. 


I Z,X,u:,n,C ocM [ tp, 


n 


k ^i^{X-.,k), ^k ^{^■.,k I C)(xM (xp,^k I ^^k, 

Sfc = (c-i + /ifc = Sfc (c-Vfc + <x,^k)) , 


where = diag(a;:^fc). The auxiliary variables are updated according to their conditional distri¬ 
bution: ljJra,k I 'Pm,k ~ PG(iVm,fc, where Nm,k = Xm,j- 

Figure 3 illustrates the power of this approach on U.S. census data. The top two plots show the 
inferred probabilities under our stick-breaking multinomial GP model for the full dataset. Interest¬ 
ing spatiotemporal correlations in name probability are uncovered. In this large-count regime, the 
posterior uncertainty is negligible since we observe thousands of names per state and year, and sim¬ 
ply modeling the transformed empirical probabilities with a GP works well. However, in the sparse 
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- SBM-LDS (Gibbs) • HMM (Gibbs) — Raw LDS (Gibbs) - LNM-LDS (pMCMC) 

Fig 4: Predictive log likelihood comparison of time series models with multinomial observations. 

data regime with only N^a = 50 observations per input, it greatly improves performance to model 
uncertainty in the latent probabilities using a Gaussian process with multinomial observations. 

The bottom panels compare four methods of predicting future names in the years 2012 and 
2013 for a down-sampled dataset with N,^ = 50: predicting based on the empirical probability 
measured in 2011; a standard GP to the empirical probabilities transformed by 7r^g (Raw GP); 
a GP whose outputs are transformed by the logistic normal function, vtln, to obtain multinomial 
probabilities (LNM GP) fit using elliptical slice sampling (Murray et ah, 2010); and our stick¬ 
breaking multinomial GP (SBM GP). In terms of ability to predict the top and bottom 10 names, 
the multinomial models are both comparable and vastly superior to the naive approaches. 

In terms of efficiency, our model is considerably faster than the logistic normal version, as shown 
in the bottom right panel. This difference is due to two reasons. First, our augmented Gibbs sampler 
is more efficient than the elliptical slice sampling algorithm used to handle the nonconjugacy in 
the LNM GP. Second, and perhaps most important, we are able to make collapsed predictions in 
which we compute the predictive distribution test '0’s given u, integrating out the training 0. In 
contrast, the LNM-GP must condition on the training GP values in order to make predictions, and 
effectively integrate over training samples using MCMG. Appendix B goes into greater detail on 
how marginal predictions are computed and why they are more efficient than predicting conditioned 
on a single value of 0. 

5. Multinomial linear dynamical systems. While discrete-state hidden Markov models 
(HMMs) are ubiquitous for modeling time series and sequence data, it can be preferable to use 
a continuous state space model. In particular, while discrete states have no intrinsic structure, 
continuous states can correspond to a natural embedding or geometry (Belanger and Kakade, 
2015). These considerations are particularly relevant to text, where word embeddings (Collobert 
and Weston, 2008) have proven to be a powerful tool. 

Gaussian linear dynamical systems (LDS) provide very efficient learning and inference algorithms, 
but they can typically only be applied when the observations are themselves linear with Gaussian 
noise. While it is possible to apply a Gaussian LDS to count vectors (Belanger and Kakade, 2015), 
the resulting model is misspecified in the sense that, as a continuous density, the model assigns 
zero probability to training and test data. However, Belanger and Kakade (2015) show that this 
model can still be used for several machine learning tasks with compelling performance, and that 
the efficient algorithms afforded by the misspecified Gaussian assumptions confer a significant 
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computational advantage. Indeed, the authors have observed that such a Gaussian model is “worth 
exploring, since multinomial models with softmax link functions prevent closed-form M step updates 
and reqnire expensive” computations (Belanger and Kakade, 2014); this paper aims to help bridge 
precisely this gap and enable efficient Gaussian LDS computational methods to be applied while 
maintaining multinomial emissions and an asymptotically unbiased representation of the posteiror. 
While there are other approximation schemes that effectively extend some of the benefits of LDSs to 
nonlinear, non-Gaussian settings, such as the extended Kalman filter (EKF) and nnscented Kalman 
filter (UKF) (Wan and Van Der Merwe, 2000; Thrun et ah, 2005), these methods do not allow for 
asymptotically unbiased Bayesian inference and can have complex behavior. Alternatively, particle 
MCMG (pMGMC) (Andrieu et ah, 2010) with ancestor resampling (Lindsten et ah, 2012) is a very 
powerful algorithm that provides unbiased Bayesian inference for very general state space models, 
but it does not enjoy the efficient block updates or conjugacy of LDSs or HMMs. 

In this section we show how to use the stick-breaking construction and its Polya-gamma aug¬ 
mentation to perform efficient inference in LDS with multinomial emissions. We focus on a Gibbs 
sampler with fully conjugate updates that utilizes standard LDS message passing for efficient block 
updates. 

The stick-breaking multinomial linear dynamical system (SBM-LDS) generates states according 
to a standard linear Gaussian dynamical system but generates multinomial observations using the 
stick-breaking logistic map: 

zo|/^o>^o ~ AA(/Xo,So), Zt\zt-i,A,B ~ W(Azt_i,S), Xt\zt,C Mult(W, 7rsB(C'zt)), 

where zt G is the system state at time t and xt G are the mnltinomial observations. 
We suppress notation for conditioning on A, B, C, /.ig, and Sg, which are system parameters of 
appropriate sizes that are given conjngate priors. The logistic normal multinomial LDS (LNM-LDS) 
is defined analogously but uses ttln in place of ttsb- 

To produce a Gibbs sampler with fully conjugate updates, we augment the observations with 
Polya-gamma random variables ujt^k- As a resnlt, the conditional state sequence zi-t\<^i-.t,xi-t 
is jointly distributed according to a Gaussian LDS in which the diagonal observation potential 
at time t is Hi{xt)\Czt,ft'i~^)-, because the observation potential is diagonal, this block up¬ 

date can be performed in only 0(TD^ + TD^K) time, and so these updates can be scaled effi¬ 
ciently to large observation dimension K. Thus the state sequence can be jointly sampled using 
off-the-shelf LDS software, and the system parameters can similarly be npdated using standard 
algorithms. The only remaining update is to the auxiliary variables, which are sampled according 
to ujt\zt, C,xr^ FG{N{xt), Czt). 

We compare the SBM-LDS and the Gibbs sampling inference algorithm to three baseline meth¬ 
ods: an LNM-LDS using pMCMG and ancestor resampling for inference, an HMM using Gibbs 
sampling, and a “raw” LDS which treats the multinomial observation vectors as observations in 
as in Belanger and Kakade (2015). We examine each method’s performance on each of three 
experiments: in modeling a sequence of 682 amino acids from human DNA with 22 dimensional 
observations, a set of 20 random AP news articles with an average of 77 words per article and a 
vocabulary size of 200 words, and an excerpt of 4000 words from Lewis Carroll’s Alice’s Adventures 
in Wonderland with a vocabulary of 1000 words. We reserved the final 10 amino acids, 10 words per 
news article, and 100 words from Alice for computing predictive likelihoods. Each linear dynamical 
model had a 10-dimensional state space, while the HMM had 10 discrete states (HMMs with 20, 
30, and 40 states all performed worse on these tasks). 

Figure 4 (left panels) shows the predictive log likelihood for each method on each experiment, 
normalized by the number of counts in the test dataset and setting to zero the likelihood under 
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a multinomial model fit to the training data mean. For the DNA data, which has the smallest 
“vocabulary” size, the HMM achieves the highest predictive likelihood, but the SBM-LDS edges out 
the other LDS methods. On the two text datasets, the SBM-LDS outperforms the other methods, 
particularly in Alice where the vocabulary is larger and the document is longer. In terms of run 
time, the SBM-LDS is orders of magnitude faster than the LNM-LDS with pMCMC (right panel) 
because it mixes much more efficiently over the latent trajectories. 

The SBM-LDS is an easy but powerful linear state space model for multinomial observations. 
The Gibbs sampler leveraging the Polya-gamma augmentation appears very efficient, performing 
comparably to an optimized HMM implementation and orders of magnitude faster than a general 
pMCMC algorithm. Because the augmentation renders the states’ conditional distribution a Gaus¬ 
sian LDS, it easily interfaces with high-performance LDS software, and extending these models 
with additional structure or covariates can be similarly modular. 

6. Related Work. The stick-breaking transformation used herein was applied to categorical 
models by Khan et al. (2012), but they used local variational bound instead of the Polya-gamma 
augmentation. Their promising results corroborate our findings of improved performance using this 
transformation. Their generalized expectation-maximization algorithm is not fully Bayesian, and 
does not integrate into existing Gaussian modeling code as easily as our augmentation. 

Gonversely, Chen et al. (2013) used the Polya-gamma augmentation in conjunction with the 
logistic normal transformation for correlated topic modeling, exploiting the conditional conjugacy 
of a single entry ijjk \ with a Gaussian prior. Unlike our stick-breaking transformation which 

admits block Gibbs sampling over the entire vector simultaneously, their approach is limited to 
single-site Gibbs sampling. As shown in our correlated topic model experiments, this has dramatic 
effects on inferential performance. Moreover, it precludes analytical marginalization and integra¬ 
tion with existing Gaussian modeling algorithms. For example, it is not immediately applicable to 
inference in linear dynamical systems with multinomial observations. 

7. Conclusion. These case studies demonstrate that the stick-breaking multinomial model 
construction paired with the Polya-gamma augmentation yields a flexible class of models with 
easy, efficient, and compositional inference. In addition to making these models easy, the methods 
developed here can also enable new models for multinomial and mixed data: the latent continuous 
structures used here to model correlations and state-space structure can be leveraged to explore 
new models for interpretable feature embeddings, interacting time series, and dependence with 
other covariates. 
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APPENDIX A: TRANSFORMING BETWEEN P(-0) AND R(7r) 

Since the mapping between tt and if) is invertible, we can compute the distribution on tt that is 
implied by a Gaussian distribution on r/?. Assume if) ~ AA(/i, El). Then, 

p(7r I /i, El) = AA(7r5g^(7r) | /r, S) 


di/j 

dvr 
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Then, 


SvTi 




d'lpk 

d-TTk 


= g 


T^k 


1 ^j<k'^j ) 1 '^j<k'^3 


dj^k 
dll 


= 0. 


'j>k 


Since the Jacobian of the inverse transformation is lower diagonal, its determinant is simply the 
product of its diagonal, 

d’j/’ 


dvr 


K 

- TT 

a( \ 

1 


fc=i 


1 “ Ylj<k '^3 


K 

- TT 

\ 1 - Ei<fc 1 

~ Ylj<k '^3 

1 

“ 11 

fc=i 

^ Ylj<k ^3 1 Ylj<k 

II 

1 v^^~l 

1 - Ej=l ^3 




Thus, the final density is, 

p( 7 r I /X, S) = I /I, S) • 


K 


1 - 


j=i 




k=l L 

Now, suppose we are given a Dirichlet distribution, tt ~ Dir(7r | ct), and we wish to compute the 
density on 'ip. We have, 

dvr I 


p{'tp I a) = Dir(7rsB('0) | «) • 


d-ip 


K 


Dir(7rsB('0) | «) • 


k=i L 


7rfc(l - 


1 \-^k—l 

1 - E,=i 


where we have used the fact that the Jacobian of the inverse transformation is simply the inverse 
of the Jacobian of the forward transformation. We simply need to rewrite the Jacobian in terms 
of Ip rather than tt. Note that 1 — '^j^k '^3 length of the remaining stick and ^{'ipk) is the 

fraction of the remaining “stick” allocated to vr^. Thus, the remaining stick length is equal to, 

1 =n 

j<k j<k j<k 

Moreover, tta, = cj(V'fc)(l - J2j<k'^3) = ^i'4’k) Ilj<k^i~'^ 3 )- Thus, 

^ ^ [(^i'^k)Uj<k^i-^3)) (Uj<k^i-^3 

k=i\ 

K 


Pi'tp I «) = Dii'(7rsB(^) I «) • n 

fc=i 

K 

= Dir(7rsB('0) I «) • 


k=l 


a 


(V'fc) n 

j<k 


Expanding the Dirichlet distribution and substituting ip for vr, we conclude that, 

A-l 


-I j.». j. 

I ol) = cj(V^fe)“'= • cj(-V^fc)^f='=+i “T 


fc=i 
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Fig 5: Density and log density of | a = 1), the density on implied by a AT = 9 dimensional symmetric 
Dirichlet density on tt with parameter a = 1. 


This factorized form is unsurprising given that the Dirichlet distribution can be written as a stick¬ 
breaking product of beta distributions in the same way that the multinomial can be written as 
a product of binomials. Each term in the product above corresponds to the transformed beta 
distribution over 

Figure 6 shows the marginal densities on implied by a AT = 9 dimensional symmetric Dirichlet 
prior on tt with a = 1. The densities of V'fc become increasingly skewed for small values of k, 
but they are still well approximate by a Gaussian distribution. In order to approximate a uniform 
distribution, we numerically compute the mean and variance of these densities to set the parameters 
of a diagonal Guassian distribution. 


APPENDIX B: MARGINAL PREDIGTIONS WITH THE AUGMENTED MODEL 

One of the primary advantages offered by the Polya-gamma augmentation is the ability to make 
marginal predictions about | jc, cj, integrating out the value of ■j/’train- For example, in the 
GP multinomial regression models described in the main text, the methods were evaluated on 
the accuracy of their predictions about future name probabilities, which were functions of if’test- 
When p('0train) P('*/’test I'Strain) both Gaussian, we can integrate out the latent training 
variables in order to predict their test values. In a latent Gaussian-multinomial model, the pos¬ 
terior distribution over those latent training variables is non-Gaussian, but after Polya-gamma 
augmentation, it is rendered Gaussian. 

With the augmentation, we can write 


1 f 

Pi-fPtest / P^'^test I Strain) P('*/’train I X, dl/>train ~ P(^ I 

^ m=l 
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1 

0 

-2 

-3 


Fig 6: Marginal density, \ x) in red shading along with the ellipses of multivariate normal conditional 
distribution | x, a;) for 4 steps of the Gibbs sampler. In Gaussian models where we aim to predict on 
test data, there are substantial gains to be had from making marginal predictions of 'i/’test I integrating 
out ■j/’train- The key is that the conditional densities overlap substantially with the marginal density. 



and perform Monte Carlo integration over ld in order to compute the predictive distribution. By 
contrast, in the standard formulation we must perform Monte Carlo integration over if), 

P(V>test = PiAest I V’Sn) V’S ~ P(^train I *)• 

m=l 


Why does the augmented model confer a predictive advantage? It does not come from performing 
Monte Carlo integration over a smaller dimension since u and i/’train of the same size. Instead, 
it comes from the ability of the conjugate Gibbs sampler to efficiently mix over and u, and 
from the ability of a single sample of u to render a conditional Gaussian distribution over xp that 
captures much of the volume of the true marginal distribution. 

This latter point is illustrated in Figure 6. The red shading shows the true marginal density 
of "0 and the blue ellipses show the conditional density for a fixed value of uj. Each ellipse capture 
a significant amount of the marginal distribution, indicating that with a single sample of u we 
can integrate over a substantial amount of the uncertainty in 0. This example is only for a, K = 3 
dimensional multinomial observation, but this intuition should extend to higher dimensions in which 
the advantages of analytical integration should be more readily apparent. 

APPENDIX C: VARIATIONAL INFERENCE FOR CORRELATED TOPIC MODELS 
We use the following factorized approximation to the posterior distribution, 


Piii’d, ^d}, Wt}, {{zn,d}},tJ.0, R/3, S/3 I 


D 


Nd 




.d=l t=l 


n=l 


YlqiPt 


.t=i 


QiPe^'^e)- 
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First let’s consider the variational distribution for •0^ and From the conjugacy of the model, 
we have 

Qii’d) 

£ 0 , = [E[diag(a;rf)]+E[S-i]]“\ 


E[(K(crf, A^rf))t] = E[cd,t - -/V(i,t/2] 

= E[cd,t - {Nd - ^2 c<i,t')/2] 
t'<t 

=nuA + \Y.n':dA-'Y 

t'<t 

Nd 

Hcd,t] = '^E[Zn,d = t]. 

71=1 

The factor for Ud is not available in closed form. We have, 


log q{ud) =E^^, 2 jlogp(zrf|i/>rf,n;rf)] Tconst. 

= E^^,^JlogPG(n;d | N{cd), tj^d)] + const. 

Instead, following (Zhou et ah, 2012), we restrict the variational factor over u to take the form 
of a Polya-gamma distribution, ujd,t r\j PG{u}d,t I Ndd, tpdd)^ where Nd^ = [N{cd)]t. To perform the 
updates for ■0^, we only need the expectations of uJd,t under the Polya-gamma factors. The mean 
of PG{b, c) distribution is available in closed form; E^..^pQ(b,c)[w] = ^ tanh(|). Since the parameters 
of the Polya-gamma distribution have variational factors, we use iterated expectations and Monte 
Carlo methods to approximate the expectation. 




1 

2 

1 

2 

1 

2 


l^Zd [-^(C|l)i] ^ipd,t 


tanh(V'd,t/2) 


i’dd 


V i'=l 

^ t-1 Nd 


E, 


'tpdd 


tanh('0d,t/2) 


'ipd,i 


^Zd [^n,d ~ ^ ] ) ®bd,t 


t'=l n=l 


tanh(V;rf^f/2) 

i’dd 


The updates for the global topic distribution parameters, and S^, depend only on their normal 
inverse-Wishart prior and the expectations with respect to q{'il^d)- These follow their standard form, 
see, for example. Bishop (2006). 
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The variational updates for and (3^ are straightforward. 

"^ogqizn^d) = \^ogp{wn 4 I Zn 4 , Od, Wt})] + const. 


= ^0d,(3 
T 


+ const. 


^ ^;n,d(log „ + log Odd) 

t=l 

y^^Zn4i'Kf3[^og/3t,w„J +IE 0 jlog 6 »rf,t]) + const. 


t=i 


This implies that q{zn^d) is categorical with parameters, 


QiZfi^d) — Cat(z72,rf I ’^n,cz)? 

Un 4 ,t = ^exp{E/3[log^t,^„ J +E0^[log6lrf,i]} , 

T 

Z = ^exp{E/3[log/3t/,^^ J +E 0 jlog 6 lrf,i/]} 

t'=i 

The challenge is that Ee^pog^^^t] is not available in closed form. Instead we must approximate it 
by Monte Carlo sampling the corresponding value of i/j^. 

Last, 


log q{Pt) = E [logp(tD \ z,6) + logp{P^ I a)] + const. 

D Na V 

= EE ^[Zn 4 = t] log (3t,w„,d + '^{cuv - 1 ) ^OgPt^v + const. 

d=l n=l 11=1 

We recognize this as a Dirichlet distribution, 


D Nd 

9(/3i) = Dir(/3i I 5i), at,v = EE ll[Wn4 = '0]HZn4 = t] + Oly. 

d=l n=l 

The data local variables, Zn 4 , 'ipd^ nnd Ud, are conditionally independent across documents. 
Moreover, since the model is fully conjugate, the expectations required to update the global vari¬ 
ables, /3^, Hq, and depend on sufficient statistics that are derived from summmations over 
documents. Rather than summing over the entire corpus of documents, we can get an unbiased 
estimate of the sufficient statistics by considering a random subset, or mini-batch, per iteration. 
This is the key to stochastic variational inference (SVI) algorithms (Hoffman et ah, 2013), which 
have been widely successful in scalable topic modeling applications. Those same gains in scalability 
are readily applicable in this correlated topic model formulation. 




